library(tidyverse) # for data import, manipulation, viz
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m── [1mAttaching packages[22m ─────────────────────────────────── tidyverse 1.3.0 ──[39m
[30m[32m✓[30m [34mggplot2[30m 3.3.0 [32m✓[30m [34mpurrr [30m 0.3.3
[32m✓[30m [34mtibble [30m 2.1.3 [32m✓[30m [34mdplyr [30m 0.8.5
[32m✓[30m [34mtidyr [30m 1.0.2 [32m✓[30m [34mstringr[30m 1.4.0
[32m✓[30m [34mreadr [30m 1.3.1 [32m✓[30m [34mforcats[30m 0.5.0[39m
[30m── [1mConflicts[22m ────────────────────────────────────── tidyverse_conflicts() ──
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
load("~/Documents/GitHub/Berkeley_DS/w203_stats/w203_project3/lab_3-master/data_clean.rda")
filter <- dplyr::filter
library(car)
Loading required package: carData
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:
recode
The following object is masked from ‘package:purrr’:
some
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(stargazer)
Please cite as:
Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(sandwich)
crime_df = data2
str(crime_df$crmrte)
num [1:90] 0.0356 0.0153 0.013 0.0268 0.0106 ...
summary(crime_df %>% select(crmrte))
crmrte
Min. :0.005533
1st Qu.:0.020604
Median :0.030002
Mean :0.033510
3rd Qu.:0.040249
Max. :0.098966
ggplot(crime_df, aes(x=crmrte)) +
geom_histogram(bins = 20)
This variable follows a quasi-normal distribution with a slight skew to the left.
qqnorm(crime_df$crmrte)
qqline(crime_df$crmrte)
The Q-Q plot shows that the crmrte variable is not normal. There is heavy positive skewness.
crime_df = crime_df %>% mutate(pris_crime_rt = prbarr * prbconv * prbpris)
justice_sys_df = crime_df %>%
select(crmrte, prbarr, prbconv, prbpris, avgsen, pris_crime_rt)
print("Summary Statistics of Criminal Justice System Variables")
[1] "Summary Statistics of Criminal Justice System Variables"
summary(justice_sys_df)
crmrte prbarr prbconv prbpris avgsen pris_crime_rt
Min. :0.005533 Min. :0.09277 Min. :0.06838 Min. :0.1500 Min. : 5.380 Min. :0.01064
1st Qu.:0.020604 1st Qu.:0.20495 1st Qu.:0.34422 1st Qu.:0.3642 1st Qu.: 7.375 1st Qu.:0.03452
Median :0.030002 Median :0.27146 Median :0.45170 Median :0.4222 Median : 9.110 Median :0.05267
Mean :0.033510 Mean :0.29524 Mean :0.55086 Mean :0.4106 Mean : 9.689 Mean :0.06668
3rd Qu.:0.040249 3rd Qu.:0.34487 3rd Qu.:0.58513 3rd Qu.:0.4576 3rd Qu.:11.465 3rd Qu.:0.07301
Max. :0.098966 Max. :1.09091 Max. :2.12121 Max. :0.6000 Max. :20.700 Max. :0.81818
paste("Number of Observations:", length(justice_sys_df$prbarr))
[1] "Number of Observations: 90"
ggplot(justice_sys_df, aes(x = prbarr)) +
geom_histogram(bins = 30)
There seems to be some outliers after 0.6. There is one data point over the value of 1, which is the most suspicious.
qqnorm(justice_sys_df$prbarr)
qqline(justice_sys_df$prbarr)
The probability of arrest’s ditribution also approximates a normal distribution. Mostly positive skeewness and one outlier affects the normality of the variable’s sample distribution.
paste("Number of Observations:", length(justice_sys_df$prbconv))
[1] "Number of Observations: 90"
ggplot(justice_sys_df, aes(x = prbconv)) +
geom_histogram(bins = 10)
There seems to be some outliers around 2. There distribution is positively skewed.
qqnorm(prbconv_df$prbconv)
qqline(prbconv_df$prbconv)
The probability of conversion has some heavy positive skewness and some outliers to the right.
print(paste("Number of Observations:", length(justice_sys_df$prbpris)))
[1] "Number of Observations: 90"
ggplot(justice_sys_df, aes(x = prbpris)) +
geom_histogram()
qqnorm(prbpris_df$prbpris)
qqline(prbpris_df$prbpris)
print(paste("Number of Observations:", length(justice_sys_df$avgsen)))
[1] "Number of Observations: 90"
We seem to have some outliers after 20. The distribution of the variable doesn’t seem very normal.
qqnorm(justice_sys_df$avgsen)
qqline(justice_sys_df$avgsen)
Fairly normally distributed but the aglomeration on the left side and some outliers on the right make the sample distribution less normal.
ggplot(justice_sys_df, aes(x = avgsen)) +
geom_histogram(bins = 40)
ggplot(justice_sys_df, aes(x = prbarr * prbconv * prbpris)) +
geom_histogram(bins = 50)
Correlation matrix for potential objective variable.
cor(justice_sys_df, method = "spearman") # using spearman's correlation because the variables don't follow a normal distribution
crmrte prbarr prbconv prbpris avgsen pris_crime_rt
crmrte 1.000000000 -0.35157427 -0.34660246 0.001218277 0.08015870 -0.5160719
prbarr -0.351574268 1.00000000 -0.19942380 0.014890972 -0.08460373 0.3605630
prbconv -0.346602461 -0.19942380 1.00000000 0.053546587 -0.01754964 0.7295633
prbpris 0.001218277 0.01489097 0.05354659 1.000000000 -0.17037915 0.3194439
avgsen 0.080158704 -0.08460373 -0.01754964 -0.170379151 1.00000000 -0.1335731
pris_crime_rt -0.516071943 0.36056303 0.72956332 0.319443873 -0.13357315 1.0000000
prbarr: Low correlation with prbconv and prbpris and avgsen. prbconv: Low correlation with prbprisand avgsen. prbpris: Low correlation with avgsen. avgsen: Low correlation with all. pris_crime_rt: High correlation with prbarr and prbconv and prbpris.
All the Criminal Justice System variables have low correlation with each other, which gives us confident that we can include them all in a model and that doing so will not increase the variance of our coefficients.
ggplot(crime_df, aes(x=prbarr, y=crmrte)) +
geom_point()+
geom_smooth(method=lm, se=FALSE) +
ggtitle("Scatterplot of Crime Rate vs Probability of Arrest")
Clear negative relationship. The higher the Probability of Arrest, the lower the Crime Rate.
ggplot(crime_df, aes(x=prbconv, y=crmrte)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
Negative realtionship although there is some positive data skewness.
ggplot(crime_df, aes(x=prbpris, y=crmrte)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
There is no strong correlation between Probability of Prision and Crime Rate. Test linear model with and without Probabiity of Prision.
ggplot(crime_df, aes(x=avgsen, y=crmrte)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
There is no strong correlation between Probability of Prision and Crime Rate. Test linear model with and without avgsen.
ggplot(crime_df, aes(x= (prbarr* prbconv * prbpris), y=crmrte)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
justice_sys_1_md = lm(crmrte ~ prbarr + prbconv, data = justice_sys_data_df)
justice_sys_1_md$coefficients
(Intercept) prbarr prbconv
0.06246762 -0.05736077 -0.02182518
plot(justice_sys_1_md)
Zero Conditional Mean: The residuals don’t seem to have a zero conditional mean. This is not a major issue because we have large sample size so we can invoke the CLT to have more confidence on the coefficients.
Normality of Errors: The residuals have a fairly normal distirbution with the exception of some positve skewness.
Heteroskadiscity: Complete violation.
Outliers: There is one observation with high leverage and influence.
Next Steps: In order to address the Zero Conditional Mean and Heteroskasdiscity issues, we will ;og-transform crmrte, prbarr, and prbconv. Those variables had a positively skewed distribution so a log transform could help. Note that all variables have values greater than zero so we can log-transform them without issues.
justice_sys_2_md = lm(log(crmrte) ~ log(prbarr) + log(prbconv), data = justice_sys_df)
justice_sys_2_md$coefficients
(Intercept) log(prbarr) log(prbconv)
-4.8450828 -0.7260022 -0.4716933
plot(justice_sys_2_md)
Zero Conditional Mean: Although the residuals don’t seem to have a perfect zero conditional mean, the residuals have gotten much closer to zero conditional mean.
Normality of Errors: The residuals have a fairly normal distirbution with the exception of some positve skewness.
Heteroskadiscity: Complete violation. We would need to use standard errors with White’s adjustment to assess signficance of t-values.
Outliers: There is one observation with high leverage but not high influence.
Next Steps: Heteroskadiscity could be a sign of omitted variables. We will add the remainder
justice_sys_3_md = lm(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris + avgsen, data = justice_sys_df)
justice_sys_3_md$coefficients
(Intercept) log(prbarr) log(prbconv) prbpris avgsen
-4.938924985 -0.727096099 -0.472492046 0.179215398 0.001880653
plot(justice_sys_3_md)
Zero Conditional Mean: Although the residuals don’t seem to have a perfect zero conditional mean, the residuals have gotten much closer to zero conditional mean.
Normality of Errors: The residuals have a fairly normal distirbution with the exception of some positve skewness.
Heteroskadiscity: Complete violation. We would need to use standard errors with White’s adjustment to assess signficance of t-values.
Outliers: There is one observation with high leverage but not high influence.
Next Steps: There was no impact on adding the prbpris and avgsen. We will exclude those from the model for now.
justice_sys_4_md = lm(log(crmrte) ~ log(pris_crime_rt), data = justice_sys_df)
justice_sys_4_md$coefficients
(Intercept) log(pris_crime_rt)
-4.9372291 -0.4695287
plot(justice_sys_4_md)
Zero Conditional Mean: Although the residuals don’t seem to have a perfect zero conditional mean, the residuals have gotten much closer to zero conditional mean.
Normality of Errors: The residuals have a fairly normal distirbution with the exception of some positve skewness.
Heteroskadiscity: Complete violation. We would need to use standard errors with White’s adjustment to assess signficance of t-values.
Outliers: There is one observation with high leverage but not high influence.
justice_police_df = crime_df %>% select(crmrte, prbarr, prbconv, prbpris, avgsen, polpc, pris_crime_rt)
summary(justice_police_df)
crmrte prbarr prbconv prbpris avgsen polpc pris_crime_rt
Min. :0.005533 Min. :0.09277 Min. :0.06838 Min. :0.1500 Min. : 5.380 Min. :0.0007459 Min. :0.01064
1st Qu.:0.020604 1st Qu.:0.20495 1st Qu.:0.34422 1st Qu.:0.3642 1st Qu.: 7.375 1st Qu.:0.0012378 1st Qu.:0.03452
Median :0.030002 Median :0.27146 Median :0.45170 Median :0.4222 Median : 9.110 Median :0.0014897 Median :0.05267
Mean :0.033510 Mean :0.29524 Mean :0.55086 Mean :0.4106 Mean : 9.689 Mean :0.0017080 Mean :0.06668
3rd Qu.:0.040249 3rd Qu.:0.34487 3rd Qu.:0.58513 3rd Qu.:0.4576 3rd Qu.:11.465 3rd Qu.:0.0018856 3rd Qu.:0.07301
Max. :0.098966 Max. :1.09091 Max. :2.12121 Max. :0.6000 Max. :20.700 Max. :0.0090543 Max. :0.81818
ggplot(justice_police_df, aes(x = polpc)) +
geom_histogram()
qqnorm(justice_police_data_df$polpc)
qqline(justice_police_data_df$polpc)
Strong positive skewness and outlier.
cor(justice_police_df, method = "spearman")
crmrte prbarr prbconv prbpris avgsen polpc pris_crime_rt
crmrte 1.000000000 -0.35157427 -0.34660246 0.0012182775 0.08015870 0.5258015393 -0.5160719
prbarr -0.351574268 1.00000000 -0.19942380 0.0148909724 -0.08460373 -0.1598304317 0.3605630
prbconv -0.346602461 -0.19942380 1.00000000 0.0535465868 -0.01754964 -0.3134131786 0.7295633
prbpris 0.001218277 0.01489097 0.05354659 1.0000000000 -0.17037915 -0.0005762123 0.3194439
avgsen 0.080158704 -0.08460373 -0.01754964 -0.1703791508 1.00000000 0.2648990814 -0.1335731
polpc 0.525801539 -0.15983043 -0.31341318 -0.0005762123 0.26489908 1.0000000000 -0.4090793
pris_crime_rt -0.516071943 0.36056303 0.72956332 0.3194438728 -0.13357315 -0.4090793102 1.0000000
polpc: medium positive correlation with prbconv and avgsen and pris_crime_rt.
Note that polpc has high correlation with pris_crime_rt so it could impact the efficiency of our coefficient estimation if both variables are included in the model.
We log-transform the polpc variable because we had noticed that doing so for positively-skewed variable reduced residuals.
ggplot(crime_df, aes(x=log(polpc), y=log(crmrte))) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
justice_police_md = lm(log(crmrte) ~ log(prbarr) + log(prbconv) + log(polpc), data = justice_police_df)
justice_police_md$coefficients
(Intercept) log(prbarr) log(prbconv) log(polpc)
-2.4471800 -0.7353714 -0.4387689 0.3693199
plot(justice_police_md)
Zero Conditional Mean: No noticible improvement compared to \(crimeRate = \beta_0 + \beta_1 * prisCrimeRt + \hat{u}\)
Normality of Errors: The residuals have a fairly normal distirbution with the exception of some positve skewness.
Heteroskadiscity: Although the residuals are still not heteroskadistic, there has been some improvement.
Outliers: There is one observation with high leverage but not high influence.
Next Steps: Heteroskadiscity could be a sign of omitted variables.
m2 <- lm(log(crmrte) ~ log(eff_cj), data = data2);